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Let there be given a contaminated list of n Revalued observa- 
tions coming from g different, normally distributed populations with 
a common covariance matrix. We compute the ML-estimator with 
respect to a certain statistical model with n — r outliers for the pa- 
rameters of the g populations; it detects outliers and simultaneously 
partitions their complement into g clusters. It turns out that the es- 
timator unites both the minimum-covariance-determinant rejection 
method and the well-known pooled determinant criterion of cluster 
analysis. We also propose an efficient algorithm for approximating 
this estimator and study its breakdown points for mean values and 
pooled SSP matrix. 



1. Introduction. The aim of cluster analysis is the partitioning of a data 
set into g disjoint subsets or clusters with common characteristics. Besides 
heuristics, there are important approaches to this problem based on statis- 
tical models, in particular, approaches by the ML and Bayesian paradigms. 
The latter offer several advantages. They allow one to compute the cluster 
criteria to be optimized and they yield algorithms that effectively, and some- 
times efficiently, reduce them; see Schroeder (1976). Finally, a model serves 
as a guide for the user in which cases to apply the method. 

This paper deals with statistical cluster analysis in the potential presence 
of contaminations. Statistical methods postulate that the data come from 
different statistical populations. After clustering, the elements of the clusters 
may be used in order to estimate the parameters of the underlying statistical 
laws. Since almost all real data contain outliers, for the method to be useful 
in practice one will have to allow that part of the data are contaminations 
or spurious elements. Accommodating or discarding them in a previous step 
is necessary for robustly estimating these parameters. 
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There are a great number of statistical techniques for the clustering prob- 
lem g > 2 in the absence of outliers. One distinguishes between mixture and 
classification models; for an overview see Hartigan (1975) and the recent 
review paper Fraley and Raftery (2002). Two (nowadays classical) statisti- 
cal partitioning methods are the trace and determinant criteria of cluster 
analysis [Friedman and Rubin (1967) and Scott and Symons (1971)]. In both 
criteria, the pooled within-groups sum of squares and products (SSP) matrix 
W of the clustering, see (2), plays a central role. These criteria postulate as 
estimators those partitions which minimize the trace and the determinant 
of W, respectively. Both methods are not only heuristically motivated: the 
resulting partitions are maximum likelihood estimators of well-defined sta- 
tistical models. Therefore, both methods perform well whenever the data set 
is a realization of random variables obeying the underlying statistical laws. 
The probabilistic model for which the trace criterion is optimal assumes that 
all populations are normally distributed with unknown mean vectors and the 
same spherical covariance matrix of unknown size. The determinant crite- 
rion retains the assumption on equality of the covariance matrices, but is less 
restrictive in dropping that on sphericity. As a consequence, the partition 
optimal for the determinant criterion is invariant not only w.r.t. location 
but also w.r.t. scale. 

In the presence of outliers and in the case g = 1, the problem reduces 
to outlier detection or robust parameter estimation and a great number of 
methods are available; for a good overview see Barnett and Lewis [(1994), 
Chapter 7]. In the case g > 2, mixture models with outliers have been well 
known for some time; see again Fraley and Raftery (2002). With the aim 
of robustifying the trace criterion, Cuesta-Albertos, Gordaliza and Matran 
(1997) introduced a trimmed version which they called impartial trimming: 
given a trimming level a € ]0, 1[ , find the subset of the data of size [n(l — ot) J 
which is optimal w.r.t. the trace criterion. They also studied its consistency. 
Later, Garcia-Escudero and Gordaliza (1999) showed robustness properties 
of the algorithm and, recently, Garcia-Escudero, Gordaliza and Matran (2003) 
presented a trimmed /c-means algorithm for approximating the minimum of 
the criterion. 

We propose first a statistical clustering model with outliers which we call 
the spurious- outliers model. The idea behind it is general enough to allow 
the derivation of robust clustering criteria with trimming under all kinds 
of distributional assumptions and cross-cluster constraints. In fact, in the 
case of normal distributions with equal and spherical covariance matrices, 
one recovers impartial trimming. Applying the method to equal but general 
covariance matrices with rejection of n — r elements, the ML-estimator leads 
us in Section 2 to a robust version of the pooled determinant criterion, the 
trimmed determinant criterion (TDC): choose a subset of size r from the n 
observations and partition it into g clusters so that the pooled SSP matrix 
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has minimum determinant. Not surprisingly, the maximum likelihood esti- 
mate of the mean vectors of the different underlying normal distributions are 
the sample mean vectors of the various clusters, whereas that of the common 
covariance matrix is the pooled SSP matrix divided by r. In the case r = n, 
the TDC simplifies to the classical determinant criterion of cluster analysis. 

The number r of regular objects of the model becomes a parameter of 
the proposed algorithm, the number of retained elements. It turns out that 
the estimated means and pooled covariance matrix are fairly insensitive to 
the choice of this number as long as it is not chosen too large. Moreover, 
we propose in Section 5 a way of estimating r by a method akin to a x 2 
goodness-of-fit test: run the algorithm several times with various values for 
r and choose the one for which the output best fits the theoretical tail 
probabilities. This rule may be satisfactory, so much the more as there is 
no rigorous and unified concept of "outlier," let alone a formal definition; 
see, for example, Barnett and Lewis (1994) or the introductory discussion 
in Ritter and Gallegos (1997). 

Apart from this parameter, we do not address the question of model find- 
ing. Normality of the population distributions, the commonness of the covari- 
ance matrix and the number of clusters are assumed as a priori given. This 
may yield criticism. However, it is straightforward to carry over the method 
to other location and scale models, for example, elliptical distributions with a 
given radial behavior. But the efficiency of the algorithm depends on that of 
the ML-estimator of the population parameters and one reason for the pop- 
ularity of the normal model is the fact that ML-estimation of its parameters 
essentially reduces to summation. We could also dispense with commonness 
of the covariance matrix. However, the present model should be preferred in 
situations where each class arises from noisy versions of g prototypes and the 
noise affects each prototype in the same way. Examples are the classification 
of phonemes in speech recognition and the chromosome classification prob- 
lem. In the former case, the prototypes are the phonemes pronounced by a 
pure speaker and in the latter, they are clean images of the chromosomes of 
the different biological classes of an organism. In both cases, outliers play 
an important role; see, for example, Ritter and Gallegos (1997). Moreover, 
different covariance matrices would require estimation of more parameters 
and might need more observations than are actually available. Estimation 
of the number of clusters is an important issue that would go beyond the 
scope of this paper. Just as there is no clear definition of outlier, there is 
none of "cluster" either. Nevertheless, both are useful concepts. More gen- 
eral distributions, cross-cluster constraints and estimation of the number of 
clusters in a Bayesian framework will be the subject matter of a forthcoming 
communication. 

Minimizing the TDC requires computing a subset of size r of the n ob- 
servations and its subsequent partitioning into g clusters (one or several 
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clusters produced by our algorithm may be empty); we call such a partition 
together with the subset a configuration. As in the case of the classical deter- 
minant criterion, its computation is infeasible except for small data sets and 
approximation algorithms which are desirable. In Section 3 we formulate a 
reduction step that, starting from an arbitrary configuration, yields another 
configuration with lower or equal determinant of the corresponding pooled 
SSP matrix; it is based on the Mahalanobis distance. Iterative application 
of reduction steps until convergence and multistart optimization yield an 
efficient approximation to the required minimum. 

A measure of robustness of an estimator is its breakdown value or break- 
down point: the minimum fraction of bad outliers needed to make it suc- 
cumb. The asymptotic breakdown value is its limit as the number of observa- 
tions increases to infinity. Estimators with zero asymptotic breakdown value 
lack robustness. This paper would be incomplete if it did not contain a word 
about this topic and, in Section 4 we compute the breakdown values of the 
TDC for the mean vectors and for the pooled SSP matrix. It turns out that 
the asymptotic breakdown value of the SSP matrix is positive. Mean values, 
too, are robust w.r.t. data sets that meet a certain condition of cluster sep- 
aration to be specified in Section 4.2. Both facts plead for robustness of the 
TDC. 

In Section 5 we offer a few simulation studies in order to assess the per- 
formance of the proposed algorithm. The error rates obtained compare fa- 
vorably even to recent studies without outliers; see Coleman and Woodruff 
(2000). 

1.1. General notation and preliminaries. Given g > 1 elements Zj, 1 < j < g, 
of a set F, z{ stands for the (/-tuple (zi, . . . , z g ) G F 9 . We write A > (A > 0) 
in order to indicate that a symmetric matrix A G M dxd , d > 1, is positive 
(semi-) definite and we denote trace and determinant of A by tr A and det A, 
respectively. The (i-dimensional identity matrix is denoted 1^. The symbol 
Nd(/i,, V) denotes both the d-variate normal distribution with mean vector 
/x G M rf and covariance matrix V G M. dxd and its Lebesgue density function. 
Nd(fi, V)(x) denotes the value of this density at x G R rf . The sum of squares 
and products matrix (SSP matrix) We of a finite, nonempty subset E <Z~M. d 
with mean is the matrix 

(1) W E = ^(x-m^Xx-msf. 

We next recall some definitions and notation in the theory of cluster anal- 
ysis. Let D = x™ = (xi,X2, . . . ,x n ) be a list of n observations G M. d . We will 
often identify the observation x$ with the corresponding index i G \..n and 
a subset of the list with the corresponding subset of l..n. Given a finite 
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set E, the notation ( ) stands for the system of all subsets of E of size 
r 

r. Let g > 1 be the number of clusters. If R is a nonempty, finite subset of 
l..n, C g {R) denotes the set of all partitions TZ = {Ri, R2, ■ ■ ■ , R g } of R into g 
clusters (or subsets), that is, the set of all configurations over the set R. Let 
TZ = {Ri, i?2> • • • j Rg} £ C g (R) be such a configuration. We will often identify 
Rj with its index j and TZ with the integral interval l..g. For a nonempty 
cluster Rj, let hir- be its sample mean vector while, for all empty clusters 
Rj , vciRj is some vector in M. d given a priori. We write m-ji := (m^ , . . . , m_R 9 ) 
and mn(j) = m^. . The pooled or within- groups sum of squares and prod- 
ucts matrix (SSP matrix) Wtj of a configuration 7£ is the sum of the SSP 
matrices (1) of all clusters, 

(2) W TC = £ £ (x - m«,)(x - m flj .) T . 

3=1 xeKj 

2. The spurious-outliers model and its ML-estimator. Let r < n be the 

assumed number of regular observations. Both the number g of clusters and 
r are input parameters of the present clustering problem (concerning the 
choice of r, see Section 5, in particular, Table 1). 

2.1. The spurious- outliers model. This section extends the usual statis- 
tical clustering setup; see Mardia, Kent and Bibby [(1979), Section 13.2], 
combining it with Mathar's [(1981), Section 5.2] outlier model. 

Let (<7^)^,g* be some family of p.d.f.'s on M. d . The parameter set of our 
statistical model is 



(3) 6: = 



U C 9 (R) 



x (m d y x {v g R dxd \ v > 0} x ^ n . 



The first factor of G stands for the unknown configuration, the next two for 
the unknown parameters of the g underlying normally distributed statistical 
populations which generate the regular observations. Finally, the last factor 
of represents the unknown statistical laws that generate the outliers. Let 
Xi, i G l..n, be n independent, Revalued random variables and let their 
p.d.f.'s conditional on the parameter = (TZ, /if, V, tp™), TZ = {Ri, ■ ■ ■ , Rg} G 
C g (R) be given by 

g = f N d {fjLj,V), ieRj, 

JXi W, i$R- 

The observations Xj are realizations of these random variables. Since both 
regular observations and outliers come from Lebesgue-continuous popula- 
tions, it is natural to assume that the realizations are in general position 
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(any d + 1 elements are affinely independent). If we additionally require 
r > gd, then the pigeon hole principle ensures that at least one cluster con- 
tains d + 1 or more elements, which implies > for all configurations 
K. 

At the expense of the stronger condition 2r — n > gd instead of r > gd, 
the condition "data set in general position" could be relaxed to "r elements 
of the data set in general position." This, too, would guarantee nonsingu- 
larity of all pooled SSP matrices W-r.. This modification would allow the 
n — r outliers to possess any pattern, for example, a regular one. It would, 
however, exclude examples with many outliers from the beginning, such as 
in Fraley and Raftery [(2002), Figure 7]. We, therefore, stick to the former 
conditions. The user may want to screen the data set for affine dependen- 
cies in a preprocessing step, at least if dimension is not high. (Otherwise, 
one may run the algorithm removing affine dependencies as soon as a sin- 
gular SSP matrix is detected.) Since the regular populations are assumed to 
be Lebesgue continuous (even normal), such dependencies are the clearest 
indications of outliers. 

The likelihood function of the spurious-outliers model for the data x™ is 



(4) L x n(ft,/z?,V,^) 



n n JVdC^vxxi 



The maximum likelihood estimate of 1Z, /if, V, tp™ is any element of the pa- 
rameter set which maximizes (4). The following proposition shows that 
the ML-estimator exists and has a simple representation if the outlier model 
and the data meet a certain condition (5). We were led to this condition and 
the method of proof of the following proposition by a similar condition ap- 
pearing in Ritter and Gallegos (2002) in a different statistical context. The 
corollary following the proposition exposes an outlier model for which this 
condition is independent of data. 



Proposition 2.1 (ML-estimator). 

(a) //, for each subset T C l..n of size n — r, the function IlieT 5V>i( Xi ) 
possesses a maximum w.r.t. (?/>j) £ \E' T ' , then the maximum likelihood esti- 
mator of the spurious- outliers model exists. 

Assume, in addition, 

(5) argmin detW^C argmax max ^(xj). 

Ke{J Re ^c 3 (R) ne{J Re ^c g (R) Mi m 

(Note that the "argmax" on the right-hand side of the inclusion exists and 
that the product depends only on the choice of the outliers.) Then: 
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(b) The MLE of the configuration TZ ( for the given statistical model ) is 
determined by the TDC 



(TDC) min detW-fc = min min detW^I. 

Ke\J R n.. n ,c g (R) ^ fl G ( 1 - r ")^ec 9 (R) J 

(c) If we denote it by H* (if it is not unique, choose one), then the MLE 
of (/x l5 . . . , /n ) is (m-R*(l), . . . ,m-ji*(g)) and that of the common covariance 
matrix V is - W^* . 

Proof. By the assumptions and the discussion at the beginning of this 
section, we have W-^. > for any TZ £ \J Re ^i..»^C g (R). Therefore, classical 

normal distribution theory [see Mardia, Kent and Bibby (1979), pages 103- 
105] shows 

9 

max max ]J ]J N d(Vj,V)(xi) 

**1 j=li€Rj 

= max(det27rV)- r / 2 IJexp-l tr( v -i J- (x* - m^(j))(x; - m n (j)) J 
j=l V i€Rj 

(6) 

= max(det 2vrV)- r/2 exp - \ tr(V _1 W n ) 

= J fT(det2vrW^)" r/2 , 

where K is a constant independent of TZ. The last equality is a direct applica- 
tion of Mardia, Kent and Bibby [(1979), Theorem 4.2.1, page 104]. Finiteness 
of \J Re n -n\ C g (R), (6), (4) and the hypothesis of (a) together imply 



max 
11 



<) 



max max ]J ]J N d (nj, V) (x, ) max g ih (x; ) 

. **i j=lieRj ^ i£R 



(7) 

= max L x n(^,/Ltf,V,V'i), 

which proves part (a). 

Now, by hypothesis (5), any configuration TZ that maximizes the first 
factor in brackets in (7) also maximizes the second one. The maximum is, 
therefore, attained at the minimum of detW-^ over all configurations TZ. 
□ 



In order to minimize the criterion (TDC), one has to determine a sub- 
set R C \..n of r elements, the set of estimated regular observations and a 
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partition 1Z of it in g clusters such that has minimum determinant. 
Since 

(6) = f[ I] N d (mn(j),-W n )(^), 

this may be restated as follows: in order to find 1Z, compute the subset of r 
observations which is optimally explained by g normal populations with a 
common sample covariance. 

Condition (5) is not very restrictive. It is satisfied if max^, gy,(xj) exists for 
all i and does not depend on i. Two special, opposite cases are the following. 

Corollary 2.1. The conclusions of Proposition 2.1 hold under each of 
the following two conditions: 

(a) ^ = WL d and gy,(x) = g(x — ip) is a location model with a density func- 
tion g having a maximum [e.g., g = Nd(0,ld) or g = j^TB) '^ b > w ^ ere B is 
some region about the origin]. 

(b) ^ is singleton and g^ is constant on a region that contains all data. 

Proof. In case (a), let M = maxg. Clearly, from the assumption, 
max Yl 94 H (xj) = J[ max g fa - V>i) = M n ~ r 

does not depend on 1Z and Condition (5) of Proposition 2.1 is satisfied. The 
same is obviously true in case (b). □ 

The corollary shows that the parameter set \& may be of any size. If it 
contains just one element, then all outliers belong to one population. The 
model also allows that each outlier comes from its own population. This 
particularity justifies the adjective "spurious." 

Criterion (TDC) is optimal (in the maximum likelihood sense) for the 
spurious-outliers model 2.1 if condition (5) is satisfied, in particular, if the 
outliers are generated from Corollary 2.1(a) or (b). Therefore, it performs 
well whenever the data set x" is a realization of random variables meeting 
this condition. However, it is also a plausible descriptive tool per se. We 
formulate the case of one cluster (robust parameter estimation) as a second 
corollary of Proposition 2.1. A similar statement for normally distributed 
outliers appears in Pesch (2000) where Mathar's outlier model is already 
used. 

Corollary 2.2. Assume that the data x" satisfies (5) [e.g., that the 
family (<7^)^ g igd is a location model]. Then Rousseeuw' 's MCD is the maxi- 
mum likelihood estimator for the spurious- outliers model 2.1 with g = l. 
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Remarks, (a) Classical determinant criterion. For r = n (the pure clus- 
tering situation), criterion (TDC) reduces to the classical determinant crite- 
rion of cluster analysis [Friedman and Rubin (1967) and Scott and Symons 
(1971)]. 

(b) Mixture model vs. classification model. There exist two approaches 
to classical (nonrobust) model-based clustering: mixture modeling and the 
classification approach. It is well known that both are related; see Mar- 
dia, Kent and Bibby [(1979), remark (4), page 365] and Fraley and Raftery 
(2002). The same can be said about mixture modeling with outliers and the 
present robust classification model 2.1. Indeed, define the parameter set of 
the mixture model with outliers as 




where (tti, . . . ,TT g ) are the mixing parameters and Gi is defined in (3). Let 
Yi,...,Y n be i.i.d., 0.. g- valued random variables whose common distribution 
under the parameter 6 = (irf,^, V,^") £ Qm is given by P \Y\ = j] = ttj, 
j £ l..g, and P e \Y\ = 0] = 1 — ^. Furthermore, let Xij, i £ l..n, j £ 0-.g, be 
n ■ (g + 1) independent, M d -valued random variables, independent of Y™ and 
with p.d.f. conditional on 9, 

= (N d ( f i j ,\), j€l..g, 
W, 3 = 0. 

The formula of total probability shows that the Revalued random variables 
X\ Yj , . . . , X n y n are independent and that the conditional p.d.f. of -2Qyj, 
i £ l..n, is given by the mixture 

(8) f XiY . (x) =E7r,iV d ( Mj) V)(x) + (l - -W(x) 

with mixing parameters iij. The data x™ is now a realization of the ran- 
dom variables -X"i : y 15 . . . ,X nj y n . Note that 1 — ^ is the prior probability of 
occurrence of an outlier. Hence, in this model the configuration 1Z is an 
unobservable random variable. In the special case of Corollary 2.1(b), one 
finds the mixture model appearing in equation (11) of Fraley and Raftery 
(2002). 

The aim of the statistical clustering model 2.1 and the mixture model is 
(robust) clustering and estimation of the means of all subpopulations and of 
their common covariance matrix. Whereas, in doing so, the former estimates 
(besides these parameters) the optimal configuration 1Z* , the latter estimates 
the probabilities of the observations to come from the different clusters. 
In this sense, both models pursue the same aim. In the clustering model, 
the prior information of the existence of n — r outliers is expressed by the 
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constraint #C/2 = n — r. The mixture model describes this fact by setting the 
probability of occurrence of a contamination to 1 — ^. Furthermore, once the 
ML-estimates (7rf,/tf,V) of the mixture (8) are known, we can regard each 
distribution Nd{fij,~V) as indicating a separate group, and individuals are 
then assigned to clusters by Bayes' allocation rule: assign the ith observation, 
i G l..n, to the class j that maximizes the posterior density irjNdfaj, V)(xj). 
The (estimated) set of regular observations R consists of the r observations 
with the r largest maxima. The corresponding optimal partition is defined 
by the class assignments of the elements of R. Similarly, once the optimal 
configuration 1Z* of the clustering model is known, we can estimate the 
mixing parameters by the cluster sizes divided by n. 

(c) Unequal cluster sizes. Being an ML-estimator, the pooled determi- 
nant criterion can be interpreted as a maximum a posteriori estimator for 
mixtures with equal mixing parameters. Therefore, it favors equal cluster 
sizes, although it can deal with small deviations from the ideal situation. 
For unequal mixing parameters, an entropy correction has to be added to 
the criterion; see Symons (1981). The same remark applies to the trimmed 
version. We will deal with this topic (and the related question of the number 
of clusters) in another communication. 

3. An efficient approximation algorithm. Minimizing the trimmed de- 
terminant criterion requires the computation of a subset of size r out of the 
n observations and its subsequent partitioning into g clusters. This task is 
infeasible, except for small data sets, and an efficient approximation algo- 
rithm is desirable. In this section we develop such a procedure. It is iterative 
and adapts the idea of minimal distance partition, now classical in cluster 
analysis [see Schroeder (1976) for a general version], to the case with outliers. 
In the classical case (without trimming), one reduces the sum of the squared 
Mahalanobis distances w.r.t. W-r. for a fixed "current" configuration 1Z by 
reassigning single observations to cluster centers with smaller Mahalanobis 
distances w.r.t. W-^. Moreover, one shows that this reduction also reduces 
the determinant of the SSP matrix. We prove below that the same idea 
can be applied also to the trimmed determinant criterion; this extension is, 
however, not straightforward. The following theorem gives rise to the basic 
reduction step of our algorithm. 

Theorem 3.1. Let 1Z and TZ new be two configurations over r -element 
subsets R, .Knew Q l..n, respectively, such that 

9 

J2 ( x i- m ^(i)) T W^ 1 (x i -m 7 j(j)) 

(9) 
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<EE( X »- m^fw^x, - m n (j)). 

j=l i&Rj 

(a) We have det W7^. ncw < det W-r, with equality if and only if W7^ new = 
W n . 

(b) Let us put m Kncvj (j) := m n (j) for all j £ l..g such that R llcw j = 0. 
If there is equality in (a), then we have also m7^ new = m^. 

Proof. At the beginning of the proof of Proposition 2.1, we have al- 
ready used the fact that the determinant of W-ji is a constant multiple of a 
negative power of the product 

n n iVd(m W o').-w w )(xi). 

j=UeRj v r j 

Claim (a) will, therefore, follow if we prove 

n n ^( m ^ew(i),;w 7 ^ new )(x i )>niv d (m TC (i) iw^w 

j=lieRnev,,j V 7 3=1 V 1 

Now, passing to likelihoods, we have 

9 ( 1 



j=iieii 



new . j 



>n n L x,(m K (j)> K 



i=iiefi n 



CW,J 



(10) =exp^ ^ ^(m^.iWtt) 



j i£R n 



\ r 



n n ix,U(j) iw w j. 

j=ueR4 v 



In this chain the first inequality follows from ML-estimation and the second 
is just the assumption. This proves the first part of (a). 

If the two determinants are equal, so are both ends of the above chain. 
Equality in (10) follows and uniqueness of the ML-estimator implies W7^. new = 
and m^ ncw (j) = m. n (j) for all j G l..g such that R ncWtj / 0. This con- 
cludes the proofs of parts (a) and (b). □ 
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Let TZ be any configuration. With the squared Mahalanobis distances 

(11) d n (i,j) 2 :={^ l -m n (j)) T W^ 1 (^ l -m n (j)), i G l..n, j G l..g. 
Inequality (9) may be rewritten as 

(12) E E m^) 2 <EE^(^) 2 - 

j = li€-Rncw,j j = li£Rj 

Theorem 3.1 is the basis of the following building block for our algorithm: 
starting from a configuration TZ, look for another configuration TZ new such 
that the corresponding sum of distance squares w.r.t. the given TZ is smaller 
than the current one; see (12). The theorem assures that this new configu- 
ration is a better approximation to the minimum of the TDC. 
Plainly, a configuration TZ ncw that minimizes the sum 

(13) E E Mhj) 2 

over all configurations {Pi,...,P g } £C g (P), PG ( L r n ), satisfies (12). For- 
tunately, computing this minimum is very simple: it is sufficient to assign 
each observation i to a cluster j G l..g which minimizes the distance square 
dfc(i,j) 2 with respect to the fixed configuration TZ. Let us call each cluster 
j G l..g that minimizes dn{i,j) 2 optimal for the ith observation, i G l..n, 
with respect to the given partition TZ. Since we must restrict our choice to 
r observations, the optimal ones are those with the r smallest distances to 
their optimal clusters. These ideas are made precise in the following corollary 
of Theorem 3.1. 

Corollary 3.1. Let TZ be a configuration and let R n cw be a subset of 
l..n consisting of r observations with the smallest Mahalanobis distances to 
their optimal clusters w.r.t. TZ (in general, R new is unique). Let TZ new be the 
partition of R ncw obtained by assigning each i G R n cw to its optimal cluster 
w.r.t. TZ. Then: 

(a) TZ ncw minimizes the objective function (13) over the set of all config- 
urations {P u . . . , P g } G C g (P), P G ( L " ) . 

(b) The conclusions of Theorem 3.1 hold. 

In the case of one class, g = 1, Corollary 3.1(b) is the basis of Rousseeuw 
and Van Driessen's C-step [Rousseeuw and Van Driessen (1999), Theo- 
rem 1]. For n = r, (the noncontaminated situation) Corollary 3.1(b) reduces 
to Spath [(1985), Theorem 3.5]. We next formulate the reduction step in 
algorithmic terms. 
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3.1. The reduction step. Input: A configuration TZ together with its mean 
vectors and its SSP matrix W^; 

Output: A configuration TZ ncw such that det W^ ncw < det W^. 

(i) Compute the distance squares dn(i,j) 2 , i G l..n, j G l..g, defined 
in (11). 

(ii) For each i G l..n, determine an optimal cluster ji G TZ, that is, ji G 
argmin jgl g d n (i,j) 2 . 

(iii) Determine a permutation k: l..n — ► l..n that satisfies 

(14) j K (i)) 2 < dw(/e(2), j K (2)) 2 < • • < d n (Hi(n), j K{n) f. 

(iv) Put -R n ew = { K (l) 5 • • • > K ( r )} and, for each j G l..<?, put R new j = {i G 
l-^b'/t(t) = il- Finally, let 7£ ne w := {-Rnew,i; • • • > ^ncw, 9 }- 

3.2. Iteration and discussion. Now, starting from an initial configura- 
tion T^o and iterating reduction steps, we obtain a sequence of configurations 
(TZk)k>o such that det ~Wn k+1 < det for all A;. Since there are only a 
finite number of configurations, this iterative process must become station- 
ary after a finite number of steps, say L, with det W-ft L = detW^ (> 0). 
By Corollary 3.1, we have W-^ L+1 = and m7£ i+1 = m^. Therefore, 
dn L {hj) = dn L+1 {i ,j), i £ l-w, j G and, if TZl is unique, a further re- 
duction step yields a configuration with sum (13) and the TDC unchanged. 
If TZl is not unique, then a further step may improve the TDC, but not (13). 
(An example of nonuniqueness in the complex plane is x/; = e l?rfc / 4 , k G 1..8, 
r = 4, g = 1 and TZ = {xi,x 3 ,x 5 ,x 7 }.) 

The configuration TZl is one approximation to the minimum. Now, mul- 
tistart optimization is applied to the foregoing iterative process; the limit 
configuration with the least value of the determinant of the corresponding 
SSP matrix is the final approximation to the minimum. 

If a configuration TZ is a global minimum of the TDC, then a reduction 
step with input and yields an equivalent configuration. 

3.3. The initial configuration. We indicate two methods for generating 
random initial configurations. Both are natural extensions of the ones pro- 
posed by Rousseeuw and Van Driessen [(1999), Section 4.1]: 

(a) Draw a random configuration consisting of nonempty clusters. 

(b) Choose at random a subset of l..n with at least gd + 1 elements. 
Construct a random partition T> of the subset in g clusters and compute 
its mean vectors and its SSP matrix Wp. Iterate a reduction step to 
obtain an initial configuration TZq. 

We conclude this section with a result concerning some geometrical prop- 
erties of any limit configuration of our algorithm. This result extends Corol- 
lary 1 of Rousseeuw and Van Driessen (1999) to robust clustering and the 
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well-known geometric separation property of discriminant analysis, see Mar- 
dia, Kent and Bibby [(1979), Theorem 11.2.1], to clustering in the presence 
of outliers. 

Corollary 3.2. Let 1Z = {R\, . . . , R g } be a limit configuration of the 
reduction step iteration, for example, the optimal configuration. 

(a) Each nonempty cluster Rj, j G l..g, is separated from the estimated 
set C U?=i Rj °f outliers by an ellipsoid. 

(b) Two different, nonempty clusters Rj and Ri are separated by the hy- 
perplane hji(xi) = 0, where hji :M. d — > R is the linear form 

hfliy) := 2[y - ±(m R (i) + m^(/))] T W^(m^(i) - m K (l)), y G R d . 
The observations i in cluster Rj are those satisfying hji(xi) > 0. 

Proof. The application of a reduction step to the configuration 1Z yields 
1Z itself as possible output. Thus, the set of regular observations R = Uf=i Rj 
may be written as {k(1), k(2), . . . , /c(r)}, where k : l..n — ► l..n is a permuta- 
tion satisfying (14), whereas the set of outliers is given by {n(r + 1), k(t + 
2), . . . , n(n)}. In order to prove part (a), let j £ l..g such that Rj ^ 0. All ob- 
servations i G i?j satisfy d-ji{i,j) 2 < maxi< m ,< r d-R,(K(m), j) 2 =: Kj, whereas 
alH ^ J? satisfy d n (i,ii) 2 > Kj [even dn(i,ji) 2 > dn(^(r),j K ^) 2 ]. The ellip- 
soid 

^ = {x£ M d |(x - m^(j)) T W^(x - m^(j)) < Kj} 

contains Rj and Ci? is contained in the closure of CEj . 

For two observations i\ and «2 i n t ne jth an d in the Zth cluster, respec- 
tively, we have dn(h,j) 2 < dn(h,l) 2 and cfo^,/) 2 < dn(i-i,j) 2 - Part (b) 
now follows from standard matrix operations. □ 

4. The breakdown values. 

4.1. Preliminaries. Besides the asymptotic breakdown value of an esti- 
mator Hampel (1968, 1971), there exists also a finite-sample version, Hodges 
(1967) and Donoho and Huber (1983). Loosely speaking, the latter measures 
the minimum fraction of bad outliers that can completely spoil the estimate. 
More precisely, let G be a locally compact parameter space, for example, 
the intersection of an open and a closed subset of some Euclidean space 
and consider an estimator 5: A — > 0. Here, A C M. n ' d is the system of all 
data sets admissible for S ("in general position" in our case). We say that 
(x' x , . . . , xJJ G A is an m-modification, m < n, of a data set (xi , . . . , x n ) G A 
if it arises from (xi, . . . ,x n ) by replacing m observations Xj with arbitrary 
elements x£ G M. d in an admissible way. An estimator 5 : M. n ' d — > Q breaks 
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down under m replacements with a data set (xi, . . . ,x n ) £ A if the set of 
estimates 

{5{x.'i, . . . , x^Kx^, . . . ,x n ) is an m-modification of (xi, . . . ,x n )} C O 

is not relatively compact in O. The individual breakdown point for x" is 
defined as 

— {5(M)\M} is not relatively compact in >; 
n J 

here M runs over all m-modifications of x™. It is the minimum fraction of 
replacements in x™ that may cause 5 to break down. 

Depending on a specific data set, this is not a useful notion per se. There- 
fore, Donoho and Huber define a value that we call the universal breakdown 
value (5(5) of 6: it is the minimum relative amount of replacements that 
causes 5 to break down with some data set x" 6 A: 

(15) 0(6) = min /?(<*, x?). 

According to this definition, the estimator breaks down at the first integer 
m for which there exists some x™ such that the estimate becomes arbitrarily 
bad for some suitable modification M. 

The universal breakdown value is the minimal individual one; it depends 
on the estimator and its parameters alone, not on data. It is pessimistic in 
considering the worst case and modifications of this notion are conceivable. 
One may argue that the existence of a single data set x" and possibly very 
special bad modifications M may not suffice to indicate lack of robustness 
of an estimator. A more relaxed definition would require the criterion to 
break down for sufficiently many data sets x™. We will introduce such a 
modification in Section 4.2. Another less stringent definition would require 
all components of the estimate to break down [such as all means in Definition 
4.1(a)]. 

The present task is, among other things, estimating mean vectors and an 
SSP matrix by means of the TDC. In the first case, = M. 9 ' d and in the 
second, is the set of all positive-definite d by d matrices, an open subset 
of C^ 1 ) -dimensional Euclidean space. 

Our definitions and analyses need the following facts. If A < B, then 
tr A < trB by linearity and det A < detB (see Lemma A. 2). Let A m i n (A) 
(A max (A)) be the least (largest) eigenvalue of a matrix A > 0. Then 

A m i n (A) = min x T Ax < min x T ~Bx = A m i n (B) 

||x||=l ||x||=l 

and, similarly, 

Amax(A) < A 

max 

(B). 
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Let E C F be nonempty, finite subsets of M. d and let and m F be their 
means. Then their SSP matrices We and We [in the simple sense (1)] satisfy 
W E = W E (m E ) < W E (rn F ) < W F (m F ) = W F . Hence, also trW E < trW>, 
detW E < detW F , X m in(W E ) < A min (VF F ) and X max (W E ) < X max (W F ). 

In the present situation, corruption of the estimates is reflected by an 
arbitrarily large value of at least one sample mean or by an arbitrarily large 
or small eigenvalue of the pooled SSP matrix of the optimal configuration; 
see Donoho and Huber (1983). Transferring definition (15) to the present 
situation, we obtain the following. 

Definition 4.1. Let n,r,g,d be such that n > r > gd + 1 as before. 
Given a data set M C JR n ' d of observations in general position, let Ai* denote 
its optimal configuration w.r.t. the TDC. 

(a) The universal breakdown value of the TDC for the mean vectors is 



Anean (n, r, g) = min min \ — 

x™ \<m<n y n 



sup max ||m_A4*(j j =oo 
M J'si-s 



here x" runs over all data sets in general position and M over all m- 
modifications of x™ in general position. 

(b) The universal breakdown value of the TDC for the SSP matrix is 

, , 1 



J m 






sup max 


\ n 


M V 



M * h \ — rw — ~' =0 ° 



where x" and M are as stated in (a) 



We first show that, in general, the universal breakdown value of the TDC 
w.r.t. the mean vectors is low. We need two lemmas; the first — a geometrical 
interpretation of the SSP matrix — is of general interest and the second is 
combinatorial and of a technical nature. The parallelepiped spanned by k+ 1 
points yo,...,yit Effi rf , k < d, is the subset 



{k 
yo + ^Ai(yi-y o )|0 
i=i 



<A,;<1 C 



Its d-dimensional volume is independent of the order of the points y^. 

Lemma 4.1. Let E = {x , . . . ,x d } QR d . We have the equality det W F 
volume 2 P{E). 

PROOF. Let m = ^-I]f=o x j- Putting 

A=( 1 "' 1 )e# 1)x(W) . 
x - m • • x d - m 
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we obtain the claim from 
volume 2 P(x , . . . ,x d ) 

= det 2 (xi -x ,...,x d -x ) 

= det 2 A = det A det A T = det 

Lemma 4.2. Let g > 2, p> 2, q> g — 2 and r = p + g be natural numbers 
and let 

F = {xi,... i Xp}U{y 1 ,y 2 }U{z 1 ,..., z g } 

with pairwise disjoint elements Xi, yk and Z\. Any partition 1Z of a subset of 
F of size r in g classes is either of the form 

TZ* = {{x\, . . . , x p }, {yi, y 2 }, g — 2 one-point classes from the z{' s} 

or possesses a class C which contains some pair {xi, yk} or some pair {zi, u], 
u^z h 

Proof. Let # r x, # r y and j^ r z be the numbers of Xj's, y^s and zi's, 
respectively, that make up the configuration TZ. By assumption, # r x + # r y + 
# r z = r and, hence, 

# r z = r- # r x - # r y >r-p-2 = g-2. 

The claim being trivial if # r z > g, we consider the three remaining cases 
# r z = g, g — 1 and g — 2 separately. Now, j^ r z = g implies j^ T x + # r y = 
r — g = p > 2; therefore, if there are no two z^s in one class, then one class 
must contain some Z\ together with an X{ or a y\.. If j^ r z = g — 1, then 
# r x + j^ r y = r — g + 1 = p + 1; since p > 2, at least one X{ and one yk must 
belong to the configuration. A simple counting argument shows the claim in 
this case. Finally, if = g — 2, then j^ r x + j^ T y = r — g + 2 = p + 2, that 
is, all Xj's and all j/^'s belong to the configuration. If all zis form one-point 
classes, then the x^s and y^'s must share the remaining two classes. If they 
are separated, then 1Z = TZ*. In the opposite case, some class contains both 
an Xi and a yk- □ 

For / y £ l d , the rank-one matrix yy T has the simple eigenvalue ||y|| 2 . 
Therefore, det (Id + yy T ) = 1 + ||y|| 2 , y G R d . Hence, if A G R dxd is positive 
definite, then we have 

(16)iet(A + yy T ) = det VX{I d + A~ 1/ ' 2 yy T A~ 1 / 2 )\/A = (1 + y T A~V) det A, 



1 

x - m 



det W E ■ 



1 

x d - m 



(\ (x -m) T 
\l (x d -m) T 
□ 
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an equality that we will repeatedly use in the sequel. 

In the following theorem we assume that, in the case of ties, an optimal 
solution is chosen for which the trace of the between-groups SSP matrix is 
minimum. This applies, in particular, to one-point clusters since these can 
be exchanged with discarded observations without any change of cost. 

Theorem 4.1 (Universal breakdown point of the TDC for the means). 

(a) If n > r + 1 and r > gd + 2, then the means remain bounded by a 
constant that depends only on the data as one observation is arbitrarily re- 
placed. 

(b) If g > 2 and r > g + 2 (besides the standard assumption r > gd + 1), 
then there is a data set such that one mean breaks down if two particular 
observations are suitably replaced. 

(c) If g > 2, n > r + 1, and r>gd + 2, then mean (n,r,g) = \. 

Proof, (a) It suffices to show that an optimal configuration 1Z* discards 
a remote replacement. The mean vectors of all clusters of TZ* will then 
remain within the convex hull of the data x™. Arguing by contradiction, 
let us assume that there is an optimal configuration TZ* which contains the 
replacement y in a cluster C E TZ* . If the replacement is far away, then this 
cluster must contain at least one other point since, otherwise, it would be 
exchanged with a discarded original point by the convention agreed upon just 
before the theorem. This point must, of course, be an original observation. 
It follows that u : = y — mc — > oo as y — ► oo. Since r > gd + 2, any subset of 
r elements, in particular the union of the members of TZ*, contains at least 
r — 1 >gd + 1 original points. Therefore, one cluster, C±, contains a subset 
E consisting of d+ 1 original points. The affine span of E is the whole space 
and, whether C\ = C or not, we have 

W n * > W E {m Cl ) + (y* - mc)(y* - m c ) T >W E + uu T . 

Therefore, by (16), 

detW^. >det(WE + uu T ) = (l + \\Wp 1/2 u\\ 2 )detW E — ► oo. 

u — >oo 

This contradicts the fact that the maximum cost of any configuration that 
discards the replacement is finite. 

(b) Let us construct a data set D = {xi, . . . , x r _ g , Wi,W2, zi, . . . , z n _ r+5 _2}. 
First note that r — g > d + 1 (distinguish between d = 1 and d>2). Hence, 
we may choose r — g elements {xi, . . . ,x r _ 9 } = F in general position with 
mean zero and SSP matrix 1^. 

We next use induction to construct the points z±, . . . , z n _ r+g _2 if n — 
r + g > 2. Suppose that z\, . . . ,zi have already been constructed for some 
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1,0 <l <n — r + g — 2. Let Ti be the system of all hyperplanes H spanned by 
d of the points in Mi = F U {zi, . . . ,zi} each. Since the set of all directions 
in W 1 parallel to some H G J-\ is a {d— l)-dimensional subspace of R d , there 
is a direction not parallel to any of the hyperplanes. By running in such a 
direction, we find z; + i so far from each hyperplane H that 

(17) volume P(z /+ i, ui, ... ,u d ) > y / 2(d+T) 
for all {ui, . . . , Ud} G ( M d l ) and such that 

(18) (1 + i(z,+i - u) t We\z 1+1 - u)) det W fi > 2 

for all u G M; and all E G (^i^)- After all Xj's and Zj's have been con- 
structed, the two points wi,W2 are chosen in the centered unit ball so that 
D is in general position. Irrespective of the optimal configuration, the g 
estimated means are within the convex hull of D. 

Now, mimicking the construction of one zj, we replace the two points wi 
and W2 with a twin pair yi ^ y2 such that ||y2 — yi|| < 1, 

(19) volume P{E) > ^2(d+l) 

for all sets E G ((^\{ Wl '™2-^ u { yi ' y2 }) that contain at least one y^ except for 
E = {yi,y2} if d=l, and such that 

(20) det^(l + i(y fc -u) T ^ 1 (y fc -u))>2, A; = 1,2, 

for all u G D \ {yi, y 2 } and all E G ( F }^ } ) . 

We claim that the optimal configuration is TZ* = {F, {yi , y2}, {zi}, . . . , {z g - 
Indeed, by (16), its cost is 

detW n * = det(W F + ±(y 2 - yi)(y 2 - yi) T ) 

= det(I d + i(y 2 - yi)(y 2 - yi) T ) 

= i + |l|y 2 -yi|| 2 <|. 

Moreover, Lemma 4.2 tells us that any clustering 1Z not equivalent with 
1Z* (equivalent in the sense that some z^'s are exchanged) possesses some 
cluster C (choose it of maximum size) containing either some pair {xj, y^} or 
some zi together with any other element. Let us denote these two elements 
by a and b. If C is of size at least d + 1, then we choose a (d+ l)-element 
subset E C C containing {a, b} and estimate 

det W n > det W c > det W E > volume 2 P(E) > 2 

d+1 

according to Lemma 4.1, (17) and (19). 
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Otherwise, we have d > #C > 2 and there exists a cluster R of size > d+ 1 
which contains no pair {xj,yj,} and no zj. We have R^C and from d > 2, it 
follows that RC F. Choosing a (d + l)-element subset E C i?, we use (18), 
(20) and (16) to estimate 

det W-^ > det(WE + W {aM ) = det We det (1 + ±(b - a) T W / £ 1 (b - a)) > 2. 

In order to conclude the proof of part (b), it is sufficient to observe that the 
norm of the mean vector of the cluster {yi,y2} may be arbitrarily large. 
Part (c) follows immediately from (a) and (b). □ 

4.2. Restricted breakdown point and the separation property. Theorem 4.1 
says that the asymptotic universal breakdown value for the means is zero; 
this is a negative result and somewhat unsatisfactory in the framework of a 
trimming algorithm. One reason is the strength of the universal breakdown 
value. We may rescue the situation by introducing a relaxed version of it, 
the restricted breakdown value (3(5, JC) w.r.t. a subclass K, C A of data sets 
admissible for 5. It lies between the individual and the universal breakdown 
values, Section 4.1, and we define it as the minimum fraction of replacements 
that cause 5 to break down with some x™ 



The universal breakdown value is just (5(5, A). The restricted breakdown 
value, too, is a characteristic of an estimator. It provides information on the 
structure that a data set must have so that the estimator still acts reasonably 
after contamination. 

Let us now compute the restricted breakdown value of (TDC) for the mean 
values w.r.t. a certain class of data sets that we describe next. It is necessary 
to first introduce some notation. Let V = {Pi, . . . , P g } be a partition of some 
data set S, let £j be the set of all mean values of nonempty subsets of Pj, 
1 < j < <7, let 5^ be the set of all subconfigurations of V comprising at most 
r elements and possessing at least one cluster of size >d+l, and let W v be 
the set of all pooled SSP matrices generated by elements of . Given g > 2 



and u > 1, we define k 3jU = |~ max { 2r n '^ g j L ^ +1 ' n (> d) and /C fl)W as 



the system of all d-dimensional data sets S of length n in general position 
that have the separation property 

S possesses a partition V in g subsets of sizes at least u (< \n/g\) such that 



min (3(5, X 



(21)min min (m^ — m,) T W 1 (m^ — m,) > 2 • 
WeW v m,e£? 

J J 



ma,x W€VV v det W 



min c g / Pj \ det Wc ' 





ROBUST CLUSTER ANALYSIS 



21 



Both sides of this estimate are invariant with respect to location and scale 
and their quotient describes a measure of validity of the partition V, a com- 
bination of cluster compactness (det W) and cluster separation (the Maha- 
lanobis distance squared). A great number of such indices are widely in use 
for assessing the quality of a partition; see Bezdek, Keller, Krisnapuram and 
Pal (1999). 

A data set satisfying condition (21) possesses a marked cluster structure. 
Note that the left-hand side of (21) increases as the different clusters in 
V are moved away from each other, whereas the right-hand side remains 
unchanged under this operation. It is, therefore, easy to construct examples 
of data sets that satisfy the separation property. 

Note that the classes IC gtU decrease as u increases. The SSP matrix Wp 
is larger than all matrices in w.r.t. the positive semi-definite ordering; 
therefore, substituting min mj . jInfc (mfc — rrij)- r Wp 1 (m/ c — mj) for the left- 
hand side of (21) and det W-p for max WeW p det W defines a narrower class. 
It is easier to verify this condition than (21). 

Lemma 4.3. Let S G JC 9t i (with associated partition V) and let 1Z be a 
partition of some finite subset R C M. d such that: 

(i) some cluster R^^TZ contains elements of at least two different Pj 7 s; 

(ii) there are a cluster Ri G 1Z and some j G l..g such that jj=(Ri nP 3 -) > 

Then we have det > max WgW p det W. 

Proof. Without loss of generality, the two P,'s appearing in (i) are 
Pi and P2. We first consider the case k = l. Putting a,j = #(i?& flPj), we use 
Lemma A. 3 to estimate 

> W Rk = - m R k )(x - m Rk f 

x£R k 

9 

= J2 J2 ( x ~ m Rk)( x - m R k f 
j=ixeR k nPj 

= ± WRtnPi+ E ^(^-^(o^-W 

j=l l<j<h<g W k 

= W Rfen p + jrjrfaRknPi ~ ™R k nP h ){m RknP] - m Rk np h f; 

i<j<h<g ^ Uk 

here we have abbreviated R^nV = (R^HPi, . . . , R^f] P g ) . Applying Lemma A. 1(b) 
with A = W RknV and y jh := J^-(m RknP . - m RknPh ), l<j<h<g, we 
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infer 



det W w > ( 1 + J2 ^ L ( m R k nP j ~ m RknPh ) T W R l np (m RknP] - m Rk nP h 

\ \<j<h<Q W k 



<j<h<g 
x detW Rfc np 

l<j<h<g 



i<i<^<9 

x WflJn^C^nP, - m flfen pjdet W fifc np 

l<j<h<g 
x detW^np; 

the last inequality follows from Lemma A. 4 and (i). Since n V € 5^ by 
(ii) and since m RknPh G 1 < h < 2, we may apply the lower bound (21) 
to the last line above to obtain 

detW^> ma *W6^detW detW > max detW> 
mm g / \ det Wc weW 

1 < 3 <S 

If k^l, we start with 

> + 

>W W + E ^(rn RknP3 - mRknP J( mRknP3 - mRknP y. 

l<j<h<g I? k 

The remainder of the proof is similar, as before. □ 

If a data set meets the separation property, then (TDC) is much more 
robust than predicted by Theorem 4.1. 

Theorem 4.2 (Restricted breakdown point of the TDC for the means). 
Let g>2, let r > (g — l)gd, and let u> n — r be an integer. 

(a) The restricted breakdown value of (TDC) for the mean values w.r.t. 
fCg. u satisfies /3 mean («,r,5,/C fl)U ) > ± min{n - r + l,r - (g - l)gd,r + u- n}. 

(b) If 

(i) 2r — n > (g — l)gd and 

(ii) u > 2(n — r), 

then mea ,n(n,r,g,}Cg )U ) = ±(n-r + 1). 
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Proof, (a) Let S G fcg,u with partition V and let M be any set obtained 
from S by modifying at most p = min{n — r,r — (g — l)gd — l,r + u — n— 1} 
elements. Our proof proceeds in several steps. 

(a) Any configuration 1Z in M has at least one cluster with d+ 1 original 
observations: 

Indeed, by definition of p, R = {J1Z has at least r — p > (g — l)gd > gd 
original observations and the claim follows from the pigeon hole principle. 

Let 1Z now be the optimal configuration of M. We will show that the 
mean values of all clusters of 1Z are bounded by a number that depends 
solely on the original data S. 

(/?) det W-R. is bounded by a number that depends only on S: 

In fact, we have 

(22) detW n < max detW. 

Indeed, let R' C M consist of r original points and let 1Z' = R' n V = 
{R! n Pi, . . . , R' n P g ); then TZ' G S v , W B - G W v and det < det W n , by 
optimality. 

(7) If Rj contains d + 1 or more original observations, then its mean rrij 
is bounded by a number that depends only on S: 
To this end, define 

Amax('S') := max{A|A eigenvalue of Wc, CCS 1 and #C > d + 1}, 
&min(5) :=min{detW c |CCS',#C , >d+l}. 

These quantities are bounded above and below (away from 0) and depend 
only on S. Now, by Steiner's identity, 

> (x - m R . ) (x - m R . ) T 

xeRjns 

= W Rjn s + #(Rj n S)(m Rj - m RjnS )(m Rj - m RjnS f 

rp 

> W Rj ns + (m R . - m R . n s)(m R . - m RjnS ) . 
Hence, by (16) and the assumption made on Rj, 

det W n > det W Rj ns(l + - m/ ?j . n5 ) T W^ 1 n5 (m Rj - m RjnS )) 

and the claim now follows from (/?). 

((5) If -Rj contains between one and d original observations, then its mean 
rrij is bounded by a number that depends only on S: 
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By (a), there is a cluster k 7^ j containing d+ \ original observations. We 
have 

> W Rk + W Rj 

rp 

> W Rk n S + W Rjn s + #(Rj n S)(m R . - m RjnS )(m Rj - m RjnS ) 

> W Rk ns + (m Rj - m RjnS ){m Rj - m RjnS ) 
by assumption on Rj] hence, 

det > detW Rk ns{l + (ra Rj - m Rj ns) W RknS (m Rj - m RjnS )) 

and the proof terminates as that of (7). 

In view of (7) and (6), the proof of part (a) will be finished if we show 
that 

(e) each Rj contains at least one original point. 

Assume, on the contrary, that 1Z contains some cluster that consists solely 
of replacements. We show that, as a consequence, R and 1Z must satisfy the 
hypotheses of Lemma 4.3. By definition of p, R has at least r — (r + u — n — 
1) = ti + 1 — u original elements; that is, J2j=i r\ Pj > n — u. Taking into 
account that each Pj has at least u elements, a simple counting argument 
shows that each of the g sets Pj intersects R. By assumption, there is some 
Rh omitted by all Pj's and the pigeon hole principle shows Lemma 4.3(i). 
Again by assumption, the original observations in R are distributed over at 
most (g — l)g (disjoint) sets of the form Ri n Pj] the definition of k 9jU and 
another application of the pigeon hole principle show that some set Ri n Pj 
contains at least k 9tU original observations; this is Lemma 4.3(h). 

The conclusion of Lemma 4.3 contradicts (22), which completes the proof 
of part (a). 

(b) The individual breakdown value of (TDC) for the mean vectors w.r.t. 
any admissible data set x" is < n ~^ +1 . Indeed, let M be a set obtained from 
x" by modifying at least n — r + 1 of its elements. Then each subset of M 
of size r contains at least r — (n — (n — r + 1)) = 1 replacements. Part (b) 
now follows from (a), (i) and (ii). □ 

In the case g = 1, (TDC) reduces to Rousseeuw's MCD; see Corollary 2.2. 
Rousseeuw (1985) proves that the asymptotic breakdown value of MCD 
with r = [(1 — a)n\ is a if a < 0.5; see also Lopuhaa and Rousseeuw (1991), 
where the analogous estimator MVE is treated in more detail. This is in 
harmony with the foregoing theorem, although it is not a corollary of it: the 
separation property and part (e) of its proof are not applicable if g = 1. 

Theorem 4.2(b) asserts that the algorithm can withstand exactly the num- 
ber of outliers generated by the model 2.1 if the hypotheses of (b) are sat- 
isfied. 
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Example 4.1. By way of discussing Theorems 4.1, 4.2 and the separa- 
tion property, it is interesting to take a look at an instructive example. Let us 
consider the one-dimensional data set xi,..., x\q shown in Figure 1 with gap 
a > 1. Its "natural" partition has the g = 2 clusters R\ = {xi, . . . ,x§} and 
i?2 = {x&, • • • , a?io}) an d the means are and a + 4. It is reasonable to choose 
this partition for V and u = 5. The assumptions of both theorems are met 
with the parameter r = 8 (i.e., the algorithm discards two observations). The 
first theorem says that (TDC) will resist one arbitrary outlier for all a > 1, 
whereas the second promises that it will tolerate even two such outliers if 
a > 20. There is actually a transition from the breakdown value 0.2 to 0.3 at a 
much smaller value of a. Let us compute this critical value. A decisive pair of 
replacements is (xf, xg). As we replace these two observations by very large, 
close numbers x' 7 , x§ with SSP e, two configurations compete for optimality: 
the configurations R[ = {x±, . . . , xq}, R' 2 = {x' 7 , x' 8 } (xg and x\o removed, one 
mean breaks down) and R" = {x±, . . . , X5}, R% = {xq, xg, x\o} (replacements 
removed, means are and a + tt). The first has SSP 10 + |(a + 2) 2 + e, 

which is smaller than that of the second, if a < — 2 ~ 1.225. Hence, 
this is the critical parameter that separates the breakdown values 0.2 and 
0.3. This indicates that (TDC) is actually more robust than predicted by 
Theorem 4.2, let alone Theorem 4.1. 

We next show that, if n > r and if r/n is large enough, the TDC is robust 
w.r.t. the SSP matrix. We actually show that it breaks down simultaneously 
for each data set as the fraction of bad outliers slightly exceeds 1 — r/n. 

Theorem 4.3 (Universal breakdown point of the TDC for the pooled 
SSP matrix). 

(a) Suppose that 2r > n + g(d + 1) . If at most n — r + g — 1 points of the 
data set D are replaced in an arbitrary way, then the eigenvalues of the SSP 

1 1 

X\ Xi £3 £4 £5 Xq Xf Xg £9 £10 
1 1 1 1 1 ■ ■ - 1 1 1 1 1 

-2-1 () 1 2 a+2 a+3 a+4 o+5 a+G 

Fig. 1. ID data set, replacements x' 7 , x' s ; breakdown at 2 replacements ifa< 1.225 and 
at3ifa> 1.225. 
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matrix of any admissible configuration remain bounded away from zero by a 
constant that depends only on D and d. 

(b) Suppose that 2r>n + g(d + 1) . If at most n — r + g — 1 points of D 
are replaced in an arbitrary way, then the eigenvalues of the SSP matrix of 
the optimal configuration remain bounded by a constant that depends only 
on D and d. 

(c) Given t>0, n — r + g elements of any D may be replaced so that the 
largest eigenvalue of the SSP matrix exceeds t. 

(d) If 2r > n + g(d + 1), then /3 SS P (n, r, g) = . 

Proof. We begin with two remarks, (i) Let D be any subset of M. d of 
size n in general position and let We be the SSP matrix of a subset E C D. 
Since D is in general position, the number 

a= min \ m m(W E ) 

is strictly positive and depends on D and d, only. 

(ii) Let M be any set obtained from D by modifying at most n — r + g — 1 
elements and let i? be any subset of M consisting of r elements. If 2r > n + 
g(d+ 1) , then contains at least r — (n — r + g — 1) =2r — n — g + l>gd+l 
original observations in D. By the pigeon hole principle, any clustering of R 
in g parts has at least one member C that contains d + 1 such points. We 
now prove (a)-(d). 

(a) The least eigenvalue of the SSP matrix of C in (ii) is > a and, by 
Section 4.1 the same is true for the SSP matrix of any admissible clustering 
of the modified set M. 

(b) Since r — g+1 points in D remain unchanged by the replacement, the 
modified set M has an admissible configuration A4 with one cluster consist- 
ing of r — g + 1 original data and g — 1 one-point clusters. Hence, the SSP 
matrix of the optimal configuration M* of M cannot have a determinant 
larger than that of A4, that is, 

detWyw* <detW^. 

The SSP matrix W_a^ is that of the r — g + 1 original points and, hence, 
depends only on D. Therefore, its determinant is bounded by a constant 7 
that again depends only on D. By (ii), at least one cluster of M* contains at 
least d + 1 original points. The claim, therefore, follows from the estimates 

A m ax(W A 4*)a d " 1 < A max (W A1 *)Amin(W A1 *)^ 1 < det W.m* < det < 7. 

(c) Modify Dbj >n — r + g replacements that are at a distance > 2t from 
each original observation and from each other to obtain a set M. Let M* 
be its optimal clustering. Clearly, any subset of r elements of M contains at 
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least g replacements. Moreover, there is a cluster C, j^C > 2, that contains 
at least one replacement. Indeed, if no cluster contains two replacements, 
then each contains exactly one and, since r > gd + 1, one cluster contains at 
least two elements. Now, if x is a replacement and y another element in C, 
then the trace of the SSP matrix of C is at least 

x + y\/ x + y\ T / x + y\/ x + y x 



|x — yll 2 

> t. 



2 

Therefore, trW^* > tr Wc > t (see 4.1) and one eigenvalue must exceed 
t/d. 

Claim (d) follows from (a)-(c). □ 

It may be astonishing that the TDC should take g — 1 replacements even 
if r = n. However, isolated replacements may be treated as one-point clusters 
in the optimal configuration; in this case, the number of clusters formed by 
the original data is reduced. Replacements that huddle together may form 
one cluster. In both cases, the SSP matrix is not completely destroyed. 

5. Simulation studies. In order to assess the performance of the proposed 
algorithm, we have implemented it as a C++ program for various dimen- 
sions, sizes, numbers and positions of clusters, as well as numbers of outliers. 
The first simulation study illustrates how, by varying the input parameter 
r (the assumed number of regular observations) of our algorithm, one can 
roughly control the amount of contaminations contained in the data set. 

The symbol E M. d , i £ l..d, stands for the ith unit vector. As usual, the 
symbol Xd( a ) denotes the o-quantile of the x 2 -distribution with d degrees of 
freedom. We consider, in dimension d = 8, the 2d normally distributed pop- 
ulations Nd(nj,\~), j £ 1..2d, with common covariance matrix V, diagonal 
with entries (1.0,1.2,1.4,1.6,1.8,2.0,2.2,9.0) and means 



(23) £i ? 



/ V(( J + l)/2,(j + l)/2) X 2 (a) 

7. e (j+l)/2> J odd > 



IV(j/2,j/2)x 2 d (a 



■e j/2 , j even, 



a G {0.95,0.99,0.999,0.999999}. That is, the means of the various clusters 
lie on the 2d axial directions of M d ; those of the jth and (j + l)st clusters, 
j odd, lie on the same axis but in opposite directions viewed from the origin. 
The means (23) assure that the squared Mahalanobis distance of two cluster 
centers on the same axis is 2\\ (a), whereas it is Xd( a ) m ^ ne opposite case. 
The values a = 0.95 and 0.99 give rise to heavily and moderately overlapped 
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clusters, while a = 0.999 and 0.999999 mean better and good separation, re- 
spectively. We generate 100 observations from each cluster. Thus, we obtain 
a total of r = 200 c? (regular) observations. Additionally, in our first essay, we 
contaminate the data with 22 d outliers arranged in shells around the cluster 
centers. The square of the Mahalanobis distance from each contamination 
to the closest cluster center is Xd((3), (3 £ {0.999,0.999999}; see Figure 2. 

Since we consider four a's and two @'s, these specifications define eight 
different cases. To each of them we apply the algorithm described in Sec- 
tion 3 four times, namely, with the a priori numbers of regular observations 
r = n, 0.95ra, 0.9n and 0.85n. More precisely, we apply the multistart method 
with up to 2000 random initial configurations based on the method 3.3(a) 
and iterate reduction steps until convergence is reached. The 32 rows in 
Table 1 show, for each choice of a, f3 and r, the fractions of estimated reg- 
ular observations whose squared Mahalanobis distances to their estimated 
cluster centers (w.r.t. the estimated common covariance matrix) are larger 
than a given percentile x^t), 7 G {0.95,0.975,0.99,0.999}. In the rows cor- 
responding to the correct fraction of outliers (~ 10%), these are expected 
to match the correct tail probabilities 1 — 7 shown on the top of Table 1. 
This heuristic (akin to a x 2 goodness-of-fit test) for estimating the number 
of outliers is suggested by a similar heuristic applied in robust discriminant 
analysis by Gather and Kale [(1988), Section 3] and Ritter and Gallegos 
(1997). The method slightly underestimates the number of outliers. We also 
compare the theoretical populations with those defined by the estimated 
clusters. The Bhattacharyya distance between two normal distributions is 

cWt (N d (Mi , V x ) , N d (/i 2 , V 2 ) ) 



VdetVxdetVa / 1 nT/x/.v-Wc 
= 1 "V det((V 1+ V 2 )/2) eXP l-4 (/l2 -^ ) (Vl + V2) { ^-^ 

a number in the unit interval. A measure for the quality of the estimates is 
the minimum over all matchings a 6 S2d between estimated and real classes 
of the maximum Bhattacharyya distance over the 2d matched pairs: 

min max c? B hatt ( N d (m a(j) , - W j , N d {^i j , V] 

The results are shown in Table 2 for d = 2, 4, 8. For d = 2 and a = 0.95, the 
original clusters are not recovered by the algorithm since the 400 regular data 
points are too homogeneous; there is no reasonable matching. We tested also 
scenarios with diffuse outliers generated from N d (fi, v-Id), d = 2, 4, 8, 
and v > 16. The case /x = 0, v = 16 is the most demanding since the variance 
is already quite close to that of the last coordinate. Even in this case, only 
about 10% of the rejected elements are (extreme) regular observations and 
the generated clusters are well rediscovered. 
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Fig. 2. Visualization of four clusters in two dimensions. The Mahalanobis distance be- 
tween each pair of the four cluster centers depends on the value a as specified by (23) . The 
outliers are arranged in shells. The inner and outer shells correspond to f3 = 0.999 and 
f3 — 0.999999, respectively, where /3 defines the ellipsoids of equal concentration on which 
the contaminations lie. See also the text, (a) a — 0.99, (b) a — 0.999. 
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Finally, the number L of reduction steps until convergence in one iteration 
is about 15 for d = 2 and 22 for d = 4,8 with standard deviations about 
7. One reduction step takes no longer than 0.004, 0.01 and 0.06 seconds, 
respectively, on a 1 GHz processor. These figures are essentially independent 
of the trimming parameter r of the algorithm. 

We do not contend that the proposed algorithm responds to each cluster- 
ing situation. In fact, the presented model is meant as a possible component 
for outlier handling in a comprehensive clustering strategy. One of the main 
purposes of the paper is a contribution to computing breakdown values. Nev- 
ertheless, in our experience, the algorithm works well in situations where the 
model assumptions are satisfied: clusters of about the same shape, scale and 
size, and outliers sufficiently scattered in space (not concentrated close to 
one or a few points). 

APPENDIX: TECHNICAL PRELIMINARIES 

In this Appendix we prepare some tools for the proofs of the theorems. 
Some of the statements are of interest on their own. 

Lemma A.l. Let d>2 and let A £ R dxd be positive definite. 

(a) For any positive semi-definite matrix C E M dxd , we have 

det(A + C) > (l + tr(A _1 C))detA + <letC. 

(b) If yi, . . . , jjk G M. d , then we have 

det ( A + E Vhvdj > y 1 + E Vh^VhJ det A - 

PROOF, (a) From A + C = A x / 2 (I d + A^CA" 1 / 2 ^ 1 / 2 , we infer 

(24) det(A + C) = det(Lj + A^^CA' 1 / 2 ) det A. 

If \\, X2, ■ ■ ■ , Xd are the eigenvalues of A _1 / 2 CA -1 / 2 , then the eigenvalues 
of Id + A-^CA- 1 / 2 are 1 + Ai, . . . , 1 + Ad and the claim follows from (24) 
and 

det (I d + A-^CA" 1 / 3 ) = f[ (1 + Xj) > 1 + Xj + ]J 

j=l j=l j=l 

= 1 + tr(A- 1 / 2 CA- 1 / 2 ) + det A^CA" 1 / 2 . 
Part (b) is an immediate consequence of (a). □ 

Lemma A. 2. < A < B implies det A < detB. If B > 0, then there is 
equality if and only if A = B . 
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Proof. If d = 1, nothing has to be shown. Otherwise, the first claim is 
plain if A is singular. If A is positive definite, then the claims follow from 
Lemma A. 1(a) with C = B - A. □ 

Lemma A. 3. Let x™ be a Euclidean data set and let {Px,...,P g } be a 
partition of \..n with cardinalities a\,...,a g . Moreover, let m be the mean 
o/x" and let uij be the mean of (xj)j G p. (arbitrary ifaj = 0). Then 

9 

Y X!( X i- m )( X i- m ) r 

9 l 

= Wp j + - a j a h(mj ~ m h )(m i - m h ) T . 

3=1 71 l<j<h<g 

Proof. Expanding the left-hand side, we obtain 

9 

Y 5Z( x *- m )( x *- m ) T 

j=iiePj 



E x ' x 



T T 
n ■ mm 



i=l 

9 9 

= Yj Yj ( XiX i" ~~ m j m J) + Y, a j m j m J ~ 71 ' mm2 

j=liePj j=l 

9 9 
= Y / + ^2 a j ' m j m J — n ■ mm 7 . 

j=i j=i 

On the other hand, 
1 



77 

!<3<h<9 



Y aja h (mj - m h )(m.j - m h ) T 

<9 

— Y a j a h( m j ~ ™- ft )(mj - m h ) T 

3,h 

- Y "j'lhinjm'j - - Y aja h m 3 W h 



T 

n " J n *— * 

3,h 3,h 



Y "i m i n, j - - ( Y «i m i J ( Y 



n 

j \ j 



a 3 m J 



Y a i m i 



m T T 
, m„ — n ■ mm . 



□ 
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Lemma A. 4. Let g > 2 and c > 2. The minimum of the sum of products 
J2i<j<i< g a j a i taken over all g -tuples (01,02, • • • ,a g ) of real numbers 01,02 > 
1, 03, . . . , a g > stic/i i/tai Sf=i a,j = c is c — 1. It is assumed exactly at the 
tuples (1, c — 1, 0, . . . , 0) and (c — 1, 1,0, ... ,0). 



Proof. We proceed by induction on g. If g = 2, we have the one- 
dimensional problem of optimizing the function ai 1— ► oi(c — 01), restricted 
to the interval [l,c — 1]. It plainly attains its minimum at the two endpoints 
1 and c — 1 . 

Assume now that the assertion has already been proved up to g and let us 
prove it for g + 1. Prom J2i<j<i< g +i a-ja-i = a g+1 (c - a g+ i) + Ei<,<k 9 <HCtj 
and the induction hypothesis, we infer by means of the principle of dynamic 
optimization, 

min ajai 



j=1 a.j=c l<j<l<g+l 



^9+1 
ai>l,a2>l 



% n J a 9+l( C - a 9+l)+^ g min a i a l) 



ai>l,a2>l 



min (a g+1 (c-a g+ i) + c-a g+1 -l) 

a 9 +i6[0,c-2] 

min (a g+1 (c - a g+i - 1) + c - 1) . 

a s+ ie[0,c-2] 



This is a one-dimensional optimization problem for a^+i G [0,c — 2]. The 
minimizer is and the minimum is c — 1 . This concludes the proof. □ 
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Table 1 

Fraction of the estimated regular observations whose squared Mahalanobis distances from 
their relative estimated population centers are greater than Xs{l)> d = 8. An estimate of the 
amount of outliers is the fraction fr r which the values shown in the corresponding row 

match 

best the theoretical tail probabilities for the \ 2 -distribution shown on top 
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u 


fl 0Q4 


n o^ 




o 

u 


n QQQ 

u.yyy 


o QQQ 

u.yyy 


0.05 


0.068 


0.042 


0.012 





0.10 


0.039 


0.018 


0.005 









0.15 


0.003 


















0.075 


0.035 


0.007 





0.999999 


0.999 


0.05 


0.068 


0.024 








0.10 


0.045 


0.023 


0.004 









0.15 


0.011 


















0.100 


0.094 


0.086 


0.051 


0.95 


0.999999 


0.05 


0.066 


0.053 


0.045 


0.027 


0.10 


0.043 


0.018 


0.007 









0.15 


0.003 


















0.104 


0.093 


0.085 


0.039 


0.99 


0.999999 


0.05 


0.069 


0.055 


0.044 


0.031 


0.10 


0.042 


0.018 


0.006 









0.15 


0.003 


















0.102 


0.099 


0.092 


0.045 


0.999 


0.999999 


0.05 


0.068 


0.057 


0.050 


0.027 


0.10 


0.041 


0.019 


0.005 









0.15 


0.004 


















0.106 


0.098 


0.085 


0.023 


0.999999 


0.999999 


0.05 


0.076 


0.059 


0.049 


0.024 


0.10 


0.046 


0.023 


0.004 









0.15 


0.011 
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Table 2 

The maximal Bhattacharyya distances of the best 
matchings between estimates and theoretical populations 



dimension/number of clusters 



a-quantile 


/3-quantile 


2/4 


4/8 


8/16 


0.95 


0.99 

0.999999 




0.0685 
0.0689 


0.0789 
0.0556 


0.99 


0.999 
0.999999 


0.0386 
0.0356 


0.0340 
0.0246 


0.0291 
0.0297 


0.999 


0.999 
0.999999 


0.0257 
0.0111 


0.0165 
0.0155 


0.0265 
0.0265 


0.999999 


0.999 
0.999999 


0.0105 
0.0104 


0.0165 
0.0176 


0.0241 
0.0240 



